The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,precision_score,recall_score,f1_score
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
import scipy.stats as stats
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
# Reading the data file
home_equity_data = pd.read_csv('hmeq.csv')
# Making a deep copy of the dataset to prevent changes to the original file
data = home_equity_data.copy()
# Viewing the first five rows of the data
data.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
# Viewing the last five rows of the data
data.tail()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5955 | 0 | 88900 | 57264.0 | 90185.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 221.808718 | 0.0 | 16.0 | 36.112347 |
| 5956 | 0 | 89000 | 54576.0 | 92937.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | 208.692070 | 0.0 | 15.0 | 35.859971 |
| 5957 | 0 | 89200 | 54045.0 | 92924.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 212.279697 | 0.0 | 15.0 | 35.556590 |
| 5958 | 0 | 89800 | 50370.0 | 91861.0 | DebtCon | Other | 14.0 | 0.0 | 0.0 | 213.892709 | 0.0 | 16.0 | 34.340882 |
| 5959 | 0 | 89900 | 48811.0 | 88934.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | 219.601002 | 0.0 | 16.0 | 34.571519 |
# Retriveing the shape of the data
data.shape
(5960, 13)
Observations:
# View column information
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
Observations:
#Check for duplicates
print(data.nunique())
BAD 2 LOAN 540 MORTDUE 5053 VALUE 5381 REASON 2 JOB 6 YOJ 99 DEROG 11 DELINQ 14 CLAGE 5314 NINQ 16 CLNO 62 DEBTINC 4693 dtype: int64
Observations:
Binary or Categorical Columns:
Numerical Columns with Some Variability:
# count the number of missing (null) values in each column
data.isnull().sum()
BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64
Observations:
# View the percentage of missing values for each column
# Calculate the total number of missing values in each column
missing_count = data.isnull().sum()
# Calculate the total number of rows in the dataset
total_rows = data.shape[0]
# Calculate the percentage of missing values for each column
missing_percentage = (missing_count / total_rows) * 100
# Create a DataFrame to display the results
missing_info = pd.DataFrame({
'Missing Count': missing_count,
'Missing Percentage': missing_percentage
})
# Sort the columns by missing percentage (descending order)
missing_info = missing_info.sort_values(by='Missing Percentage', ascending=False)
# Display the missing information
print(missing_info)
Missing Count Missing Percentage DEBTINC 1267 21.258389 DEROG 708 11.879195 DELINQ 580 9.731544 MORTDUE 518 8.691275 YOJ 515 8.640940 NINQ 510 8.557047 CLAGE 308 5.167785 JOB 279 4.681208 REASON 252 4.228188 CLNO 222 3.724832 VALUE 112 1.879195 BAD 0 0.000000 LOAN 0 0.000000
Observations:
# Statement to check for and summarize duplicates
data.duplicated().sum()
0
Obersvations:
# Convert the target variable to an object
data['BAD'] = data['BAD'].astype('object')
# Check data info to ensure BAD was converted from int to obj
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null object 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(1), object(3) memory usage: 605.4+ KB
# Creating a statistical summary for numericals
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| LOAN | 5960.0 | 18607.969799 | 11207.480417 | 1100.000000 | 11100.000000 | 16300.000000 | 23300.000000 | 89900.000000 |
| MORTDUE | 5442.0 | 73760.817200 | 44457.609458 | 2063.000000 | 46276.000000 | 65019.000000 | 91488.000000 | 399550.000000 |
| VALUE | 5848.0 | 101776.048741 | 57385.775334 | 8000.000000 | 66075.500000 | 89235.500000 | 119824.250000 | 855909.000000 |
| YOJ | 5445.0 | 8.922268 | 7.573982 | 0.000000 | 3.000000 | 7.000000 | 13.000000 | 41.000000 |
| DEROG | 5252.0 | 0.254570 | 0.846047 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| DELINQ | 5380.0 | 0.449442 | 1.127266 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 15.000000 |
| CLAGE | 5652.0 | 179.766275 | 85.810092 | 0.000000 | 115.116702 | 173.466667 | 231.562278 | 1168.233561 |
| NINQ | 5450.0 | 1.186055 | 1.728675 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 17.000000 |
| CLNO | 5738.0 | 21.296096 | 10.138933 | 0.000000 | 15.000000 | 20.000000 | 26.000000 | 71.000000 |
| DEBTINC | 4693.0 | 33.779915 | 8.601746 | 0.524499 | 29.140031 | 34.818262 | 39.003141 | 203.312149 |
**Observations:
Numerical
LOAN:
MORTDUE:
VALUE:
YOJ (Years at Present Job):
DEROG (Number of Major Derogatory Reports):
DELINQ (Number of Delinquent Credit Lines):
CLAGE (Age of Oldest Credit Line in Months): - The age of the oldest credit line varies from 0 months to 1,168.23 months (nearly 97 years).
NINQ (Number of Recent Credit Inquiries):
CLNO (Number of Existing Credit Lines):
DEBTINC (Debt-to-Income Ratio):
# Creating a statistical summary for categoricals
data.describe(include=['object']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| BAD | 5960 | 2 | 0 | 4771 |
| REASON | 5708 | 2 | DebtCon | 3928 |
| JOB | 5681 | 6 | Other | 2388 |
Observations:
Categorical
BAD:
REASON:
JOB:
Leading Questions:
#Plot missing values for further obervation
# Create a dual-axis bar chart
fig, ax1 = plt.subplots(figsize=(12, 6))
#Set the background to white
ax1.set_facecolor('white')
# Bar chart for missing count
ax1.bar(missing_info.index, missing_info['Missing Count'], color='b', alpha=0.7, label='Missing Count')
ax1.set_xlabel('Features')
ax1.set_ylabel('Missing Count', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
# Twin the x-axis for the second bar chart
ax2 = ax1.twinx()
# Bar chart for missing percentage
ax2.bar(missing_info.index, missing_info['Missing Percentage'], color='r', alpha=0.7, label='Missing Percentage')
ax2.set_ylabel('Missing Percentage (%)', color='red')
ax2.tick_params(axis='y', labelcolor='red')
# Set x-axis labels
plt.xticks(rotation=45)
plt.xlabel('Features')
# Display the legend
fig.legend(loc='upper right', bbox_to_anchor=(0.85, 0.9))
# Show the plot
plt.title('Missing Values Count and Percentage by Feature')
plt.tight_layout()
plt.show()
def univariate_analysis(data, column_name):
"""
Perform univariate analysis on a specified column of a DataFrame.
Parameters:
- data (pd.DataFrame): The DataFrame containing the data.
- column_name (str): The name of the column to analyze.
Returns:
- None (displays plots).
"""
# Set up subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(13, 6))
# Histogram
sns.histplot(data=data, x=column_name, ax=axes[0], kde=True)
axes[0].set_title(f'Histogram of {column_name}')
axes[0].axvline(data[column_name].mean(), color='lightgreen', linestyle='dashed', linewidth=2, label='Mean')
axes[0].axvline(data[column_name].median(), color='red', linestyle='dashed', linewidth=2, label='Median')
axes[0].legend()
# Box plot
sns.boxplot(data=data, x=column_name, ax=axes[1])
axes[1].set_title(f'Box Plot of {column_name}')
axes[1].axvline(data[column_name].mean(), color='lightgreen', linestyle='dashed', linewidth=2, label='Mean')
axes[1].axvline(data[column_name].median(), color='red', linestyle='dashed', linewidth=2, label='Median')
axes[1].legend()
# Adjust layout
plt.tight_layout()
plt.show()
univariate_analysis(data, 'LOAN')
Observations:
univariate_analysis(data, 'MORTDUE')
Observations:
univariate_analysis(data, 'VALUE')
Observatios:
univariate_analysis(data, 'YOJ')
Observations:
univariate_analysis(data, 'DEROG')
Observations:
univariate_analysis(data, 'DELINQ')
Observations:
univariate_analysis(data, 'CLAGE')
Observations:
univariate_analysis(data, 'NINQ')
Observations:
univariate_analysis(data, 'CLNO')
Observations:
univariate_analysis(data, 'DEBTINC')
Observations:
def univariate_categorical_analysis(data, column, show_percentage=True):
# Replace various representations of missing values with NaN
data[column].replace(['nan', 'Nan', 'Blank'], np.nan, inplace=True)
# Calculate value counts, including NaN
value_counts = data[column].value_counts(dropna=False)
# Plot the counts
plt.figure(figsize=(8, 6))
ax = value_counts.plot(kind='bar', rot=45)
ax.set_xlabel(column)
ax.set_ylabel('Count')
# Customize the title based on the column
if column == 'BAD':
ax.set_title('Loan Outcome Distribution')
else:
ax.set_title(f'Count and Percentage Distribution of {column}')
ax.set_xticklabels(value_counts.index, rotation=0)
# Adds class counts on the bars
if column == 'BAD':
ax.set_xticklabels(['Good', 'Bad'], rotation=0)
if show_percentage:
total = len(data)
for p in ax.patches:
height = p.get_height()
ax.annotate(f'{height}\n({height / total:.2%})', (p.get_x() + p.get_width() / 2., height),
ha='center', va='baseline', fontsize=10)
else:
for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='baseline', fontsize=12)
# Example usage:
categorical_columns = ['BAD', 'REASON', 'JOB']
for column in categorical_columns:
univariate_categorical_analysis(data, column)
plt.show()
Observations:
Count and Percentage distribution of BAD
Count and Percentage distribution of REASON
Count and Percentage distribution of JOB
def visualize_numerical_categorical_relationship(data, numerical_column, categorical_column):
"""
Visualizes the relationship between a numerical variable and a categorical variable using a box plot with annotations.
Parameters:
data (DataFrame): The DataFrame containing the data.
numerical_column (str): The name of the numerical column.
categorical_column (str): The name of the categorical column.
Returns:
None (displays the plot).
"""
plt.figure(figsize=(10, 6))
sns.boxplot(x=categorical_column, y=numerical_column, data=data)
plt.title(f'Relationship between {categorical_column} and {numerical_column}')
plt.xlabel(categorical_column)
plt.ylabel(numerical_column)
# Add annotations (numbers) to the box plot
for cat in data[categorical_column].unique():
box_data = data[data[categorical_column] == cat][numerical_column]
y_pos = box_data.median() + 0.5 # Adjust the vertical position of the annotation
plt.text(data[categorical_column].unique().tolist().index(cat), y_pos,
f'Median: {box_data.median():.2f}\nCount: {box_data.count()}',
ha='center', va='center', fontsize=10, color='black', bbox=dict(facecolor='white', alpha=0.5))
plt.show()
visualize_numerical_categorical_relationship(data, 'LOAN', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median "LOAN" amount is $14,900, and there are 1,189 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median "LOAN" amount is $16,900, and there are 4,771 cases in this category.
visualize_numerical_categorical_relationship(data, 'MORTDUE', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median "MORTDUE" amount is approximately $60,279, and there are 1,083 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median "MORTDUE" amount is approximately $66,839, and there are 4,359 cases in this category.
These insights suggest that, on average, clients who defaulted on loans ("BAD" = 1) tend to have slightly higher mortgage due amounts ("MORTDUE") compared to clients who repaid their loans ("BAD" = 0).
visualize_numerical_categorical_relationship(data, 'YOJ', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median value for "YOJ" is approximately 6 years, and there are 1,124 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median value for "YOJ" is approximately 7 years, and there are 4,321 cases in this category.
These insights suggest that clients who defaulted on loans ("BAD" = 1) tend to have, on average, slightly more years of experience at their present jobs ("YOJ") compared to clients who repaid their loans ("BAD" = 0).
visualize_numerical_categorical_relationship(data, 'DEROG', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median value for "DEROG" is 0, and there are 1,102 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median value for "DEROG" is also 0, and there are 4,150 cases in this category.
These insights suggest that the number of major derogatory reports ("DEROG") does not appear to vary significantly between clients who defaulted on loans ("BAD" = 1) and clients who repaid their loans ("BAD" = 0). Both groups have a median value of 0 for this feature.
visualize_numerical_categorical_relationship(data, 'DELINQ', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median value for "DELINQ" is 0, and there are 1,117 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median value for "DELINQ" is also 0, and there are 4,263 cases in this category.
Similar to the "DEROG" feature, these insights suggest that the number of delinquent credit lines ("DELINQ") does not appear to vary significantly between clients who defaulted on loans ("BAD" = 1) and clients who repaid their loans ("BAD" = 0). Both groups have a median value of 0 for this feature.
visualize_numerical_categorical_relationship(data, 'CLAGE', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median value for "CLAGE" is approximately 132.87 months, and there are 1,111 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median value for "CLAGE" is approximately 180.42 months, and there are 4,541 cases in this category.
These results indicate that there might be a difference in the age of the oldest credit line ("CLAGE") between clients who defaulted on loans ("BAD" = 1) and those who repaid their loans ("BAD" = 0). On average, clients who defaulted on loans tend to have older credit lines than those who repaid their loans. This difference could be an important factor in predicting loan defaults.
visualize_numerical_categorical_relationship(data, 'NINQ', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median value for "NINQ" is approximately 1, and there are 1,114 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median value for "NINQ" is also approximately 1, and there are 4,336 cases in this category.
These results indicate that there doesn't appear to be a significant difference in the number of recent credit inquiries ("NINQ") between clients who defaulted on loans ("BAD" = 1) and those who repaid their loans ("BAD" = 0). The median value of "NINQ" is the same for both categories, suggesting that this feature might not be a strong predictor of loan defaults.
visualize_numerical_categorical_relationship(data, 'CLNO', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median value for "CLNO" is approximately 20, and there are 1,136 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median value for "CLNO" is also approximately 20, and there are 4,602 cases in this category.
These results indicate that there doesn't appear to be a significant difference in the number of existing credit lines ("CLNO") between clients who defaulted on loans ("BAD" = 1) and those who repaid their loans ("BAD" = 0). The median value of "CLNO" is the same for both categories, suggesting that this feature might not be a strong predictor of loan defaults.
visualize_numerical_categorical_relationship(data, 'DEBTINC', 'BAD')
Observations:
For the category where "BAD" is 0 (indicating that loans were repaid), the median value for "DEBTINC" is approximately 38.08, and there are 403 cases in this category.
For the category where "BAD" is 1 (indicating that clients defaulted on loans), the median value for "DEBTINC" is approximately 34.54, and there are 4,290 cases in this category.
These results indicate that there may be some variation in the "DEBTINC" values between clients who defaulted on loans ("BAD" = 1) and those who repaid their loans ("BAD" = 0). Specifically, clients who defaulted on loans tend to have a lower median debt-to-income ratio ("DEBTINC") compared to those who repaid their loans. This suggests that the "DEBTINC" feature may have some predictive power in determining loan defaults.
def create_stacked_bar_plot_with_labels(data, x, y):
"""
Print the category counts, plot a stacked bar chart, and add legends.
Parameters:
- data: DataFrame
The DataFrame containing the data.
- x: str
The target variable.
- y: str
The categorical variable.
Returns:
- None (displays the stacked bar chart).
"""
# Calculate the number of unique categories in the categorical variable
count = data[y].nunique()
# Sort the target variable by counts and create a cross-tabulation
sorter = data[x].value_counts().index[-1]
tab1 = pd.crosstab(data[y], data[x], margins=True).sort_values(
by=sorter, ascending=False
)
# Print the category counts
print(tab1)
print("-" * 120)
# Create a cross-tabulation with percentage values
tab = pd.crosstab(data[y], data[x], normalize="index").sort_values(
by=sorter, ascending=False
)
# Plot a stacked bar chart
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
# Add legends
plt.legend(loc="lower left", frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.title(f'Distribution of {y} by {x}')
# Display the chart
plt.show()
create_stacked_bar_plot_with_labels(data, 'BAD', 'REASON')
BAD 0 1 All REASON All 4567 1141 5708 DebtCon 3183 745 3928 HomeImp 1384 396 1780 ------------------------------------------------------------------------------------------------------------------------
Observations:
create_stacked_bar_plot_with_labels(data, 'BAD', 'JOB')
BAD 0 1 All JOB All 4515 1166 5681 Other 1834 554 2388 ProfExe 1064 212 1276 Mgr 588 179 767 Office 823 125 948 Self 135 58 193 Sales 71 38 109 ------------------------------------------------------------------------------------------------------------------------
Observations:
# Calculate the correlation matrix excluding the 'BAD' column
org_correlation_matrix = data.drop(columns=['BAD']).corr()
# Create a heat map
plt.figure(figsize=(10, 8))
sns.heatmap(org_correlation_matrix, vmax=1, vmin=-1, annot=True, cmap='coolwarm',
linewidth=.5)
plt.title("Correlation Matrix Heatmap (Excluding BAD)")
plt.show()
# Display the correlation matrix (excluding BAD)
# org_correlation_matrix
Observations:
"VALUE" and "MORTDUE" have a strong positive correlation of approximately 0.88. This suggests that the current value of the property ("VALUE") and the amount due on the existing mortgage ("MORTDUE") are closely related, which is expected.
"CLAGE" (age of the oldest credit line) has a moderate positive correlation with "CLNO" (number of existing credit lines) and "VALUE." This indicates that older credit lines tend to be associated with a higher number of existing credit lines and higher property values.
"DEROG" (number of major derogatory reports) has a moderate positive correlation with "DELINQ" (number of delinquent credit lines). This suggests that individuals with more major derogatory reports also tend to have more delinquent credit lines.
"NINQ" (number of recent credit inquiries) and "DEBTINC" (debt-to-income ratio) have a weak positive correlation with several other features, indicating mild relationships.
"YOJ" (years at present job) has weak correlations with most other features, suggesting that it may not have strong linear relationships with other variables.
"LOAN" (amount of loan approved) has weak correlations with most other features, indicating that the loan amount approved may not be strongly correlated with other numerical variables.
# Create a pair plot
sns.pairplot(data, diag_kind='kde', markers='o', hue='BAD')
plt.show()
Observations:
def treat_outliers(df,col):
'''
treats outliers in a varaible
col: str, name of the numerical varaible
df: data frame
col: name of the column
'''
Q1=data[col].quantile(0.25) # 25th quantile
Q3=data[col].quantile(0.75) # 75th quantile
IQR=Q3-Q1 # IQR Range
Lower_Whisker = Q1 - 1.5 * IQR #define lower whisker
Upper_Whisker = Q3 + 1.5 * IQR # define upper Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker
# and all the values above upper_whishker will be assigned value of upper_Whisker
return df
def treat_outliers_all(df, col_list):
'''
Treat outliers in all numerical variables except the 'BAD' column.
col_list: list of numerical variables
df: data frame
'''
for c in col_list:
if c != 'BAD': # Skip the 'BAD' column
df = treat_outliers(df, c)
return df
df_raw = data.copy()
numerical_col = df_raw.select_dtypes(include=np.number).columns.tolist()# getting list of numerical columns
df = treat_outliers_all(df_raw,numerical_col)
# Function to create a binary flag for missing values in a column
def create_missing_value_flag(df, col):
"""
Create a binary flag for missing values in the specified column.
Parameters:
df (DataFrame): The DataFrame containing the data.
col (str): The name of the column with missing values.
Returns:
df (DataFrame): The DataFrame with a binary flag for missing values in the specified column.
"""
new_col_name = str(col)
new_col_name = col + '_missing_flag'
df[new_col_name] = df[col].isna().astype(int)
return df
# List of columns with missing values
columns_with_missing_values = [col for col in df.columns if df[col].isnull().any()]
# Create binary flags for missing values in the specified columns
for column in columns_with_missing_values:
df = create_missing_value_flag(df, column)
# Treat Missing values in numerical columns with median and mode in categorical variables
# Select numeric columns.
num_data = df.select_dtypes('number')
# Select string and object columns.
cat_data = df.select_dtypes('object').columns.tolist()#df.select_dtypes('object')
# Fill numeric columns with median.
# Remove _________ and complete the code
df[num_data.columns] = num_data.fillna(num_data.median())
# Fill object columns with model.
# Remove _________ and complete the code
for column in cat_data:
mode = df[column].mode()[0]
df[column] = df[column].fillna(mode)
What are the the most important observations and insights from the data based on the EDA performed?
# Data Preprocessing and Splitting
# Removing the dependent variable
X = df.drop(["BAD"], axis=1)
# Defining the dependent (target) variable
Y = df['BAD']
# Get dummies for X
X = pd.get_dummies(X, columns=['REASON', 'JOB'], drop_first=True)
# Splitting the data in 70:30 ratio for train to test data
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
# Checking the shape of the train and test data
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (4172, 27) Shape of test set : (1788, 27) Percentage of classes in training set: 0 0.804171 1 0.195829 Name: BAD, dtype: float64 Percentage of classes in test set: 0 0.791946 1 0.208054 Name: BAD, dtype: float64
def classification_metrics(y_true, y_pred, class_labels=None):
"""
Display the classification report and confusion matrix.
Parameters:
y_true (array-like): True labels.
y_pred (array-like): Predicted labels.
class_labels (list, optional): List of class labels for display. If not provided, it will be inferred.
Returns:
None
"""
if class_labels is None:
class_labels = sorted(set(y_true)) # Infer class labels from y_true
# Map binary labels to meaningful labels
class_labels = ["Loan Repaid", "Defaulted on Loan"]
# Print the classification report
report = classification_report(y_true, y_pred, target_names=class_labels)
print("Classification Report:\n", report)
# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Create a heatmap for the confusion matrix
plt.figure(figsize=(8, 5))
sns.heatmap(cm, annot=True, fmt=".2f", xticklabels=class_labels, yticklabels=class_labels)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predicting using the independent variables
pred = model.predict(predictors)
precision = round(precision_score(target, pred, average='macro'), 2) # Precision score
recall = round(recall_score(target, pred, average='macro'), 2) # Recall score
accuracy = round(accuracy_score(target, pred), 2) # Accuracy score
# Creating a dictionary of metrics
metrics_dict = {
"Precision": precision,
"Recall": recall,
"Accuracy": accuracy,
}
return metrics_dict
# Create a Logistic Regression model
logistic_model = LogisticRegression(random_state=1)
# Fit the model on the training data
logistic_model.fit(X_train, y_train)
# Predict on the training set
y_train_pred_log_reg = logistic_model.predict(X_train)
# Evaluate the performance on the training set
classification_metrics(y_train, y_train_pred_log_reg)
Classification Report:
precision recall f1-score support
Loan Repaid 0.81 1.00 0.89 3355
Defaulted on Loan 0.69 0.02 0.05 817
accuracy 0.81 4172
macro avg 0.75 0.51 0.47 4172
weighted avg 0.78 0.81 0.73 4172
Observations:
# Predict on the test set
y_test_pred_log_reg = logistic_model.predict(X_test)
# Evaluate the performance on the test set
classification_metrics(y_test, y_test_pred_log_reg)
Classification Report:
precision recall f1-score support
Loan Repaid 0.79 1.00 0.88 1416
Defaulted on Loan 0.43 0.01 0.02 372
accuracy 0.79 1788
macro avg 0.61 0.50 0.45 1788
weighted avg 0.72 0.79 0.70 1788
logistic_regression_test = model_performance_classification(logistic_model, X_test, y_test)
logistic_regression_test
{'Precision': 0.61, 'Recall': 0.5, 'Accuracy': 0.79}
Observations:
# Building the model
decision_tree = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
# Fitting the decision tree model
decision_tree.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)# Checking performance on the training dataset
y_train_pred_decision_tree = decision_tree.predict(X_train)
classification_metrics(y_train, y_train_pred_decision_tree)
Classification Report:
precision recall f1-score support
Loan Repaid 1.00 1.00 1.00 3355
Defaulted on Loan 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
# Checking performance on the test dataset
y_test_pred_decision_tree = decision_tree.predict(X_test)
classification_metrics(y_test, y_test_pred_decision_tree)
Classification Report:
precision recall f1-score support
Loan Repaid 0.89 0.94 0.91 1416
Defaulted on Loan 0.70 0.57 0.63 372
accuracy 0.86 1788
macro avg 0.80 0.75 0.77 1788
weighted avg 0.85 0.86 0.85 1788
decision_tree_test = model_performance_classification(decision_tree, X_test, y_test)
decision_tree_test
{'Precision': 0.8, 'Recall': 0.75, 'Accuracy': 0.86}
Observations:
# Calculate the feature importances
important_features = decision_tree.feature_importances_
# Storing the names of all features
columns = X.columns
# Creating a data frame to store the calculated feature importances
important_features_df = pd.DataFrame(important_features, index = columns, columns= ['Importance']).sort_values(by = 'Importance', ascending = False)
# Display the feature importances
sns.barplot(x = important_features_df.Importance, y = important_features_df.index)
plt.title('Feature Importances')
Text(0.5, 1.0, 'Feature Importances')
Observations:
Criterion {“gini”, “entropy”}
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
max_depth
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf
The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
You can learn about more Hyperpapameters on this link and try to tune them.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# Selecting the classifier
decision_tree_tuned = DecisionTreeClassifier(class_weight = {0: 0.3, 1: 0.7}, random_state= 1)
#decision_tree = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [5, 10, 20, 25]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Running the grid search
gridCV = GridSearchCV(decision_tree_tuned, parameters, scoring = scorer, cv = 10)
# Fitting the grid search on training data
gridCV = gridCV.fit(X_train, y_train)
# Using the best combinations of parameters on the classifier
decision_tree_tuned = gridCV.best_estimator_
# Fitting the best estimator to the training data
decision_tree_tuned .fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, max_depth=4,
min_samples_leaf=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, max_depth=4,
min_samples_leaf=5, random_state=1)# Checking performance on the training data
y_train_pred_decision_tree_tuned = decision_tree_tuned.predict(X_train)
classification_metrics(y_train, y_train_pred_decision_tree_tuned)
Classification Report:
precision recall f1-score support
Loan Repaid 0.94 0.90 0.92 3355
Defaulted on Loan 0.65 0.76 0.70 817
accuracy 0.87 4172
macro avg 0.79 0.83 0.81 4172
weighted avg 0.88 0.87 0.88 4172
Observations:
# Checking performance on the training data
y_test_pred_decision_tree_tuned = decision_tree_tuned.predict(X_test)
classification_metrics(y_test, y_test_pred_decision_tree_tuned)
Classification Report:
precision recall f1-score support
Loan Repaid 0.93 0.90 0.92 1416
Defaulted on Loan 0.67 0.73 0.70 372
accuracy 0.87 1788
macro avg 0.80 0.82 0.81 1788
weighted avg 0.87 0.87 0.87 1788
Observations:
decision_tree_tuned_test = model_performance_classification(decision_tree_tuned, X_test, y_test)
decision_tree_tuned_test
{'Precision': 0.8, 'Recall': 0.82, 'Accuracy': 0.87}
# Calculate the feature importances
important_features = decision_tree_tuned.feature_importances_
# Storing the names of all features
columns = X.columns
# Creating a data frame to store the calculated feature importances
important_features_df = pd.DataFrame(important_features, index = columns, columns= ['Importance']).sort_values(by = 'Importance', ascending = False)
# Display the feature importances
sns.barplot(x = important_features_df.Importance, y = important_features_df.index)
plt.title('Feature Importances')
Text(0.5, 1.0, 'Feature Importances')
Observations:
# Displaying the decision tree
features = list(X.columns)
class_names = ["Loan Repaid", "Defaulted on Loan"]
plt.figure(figsize = (40, 15))
tree.plot_tree(decision_tree_tuned, max_depth = 4, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = class_names)
plt.show()
Observations:
Here are the key points and insights from the decision tree analysis:
Initial Decision: The decision tree's classification begins by assessing the presence of the DEBTINC_missing_flag. This initial decision is crucial in predicting loan defaults.
Node #0: The root node splits the data into two groups based on whether the DEBTINC_missing_flag is present or absent. The majority class at this node is "Loan Repaid," indicating that a significant number of instances fall into this category.
Node #1: Within the group where DEBTINC_missing_flag is less than or equal to 0.5, the tree further examines the DEBTINC feature. When DEBTINC is less than or equal to 43.993, the majority class remains "Loan Repaid."
Node #2: In the subset where DEBTINC_missing_flag is 0 and DEBTINC is less than or equal to 43.993, the model explores the presence of the VALUE_missing_flag. The majority class remains "Loan Repaid."
Node #3: For instances with DEBTINC_missing_flag equal to 0, DEBTINC less than or equal to 43.993, and VALUE_missing_flag less than or equal to 0.5, the tree assesses the CLAGE feature. Again, the majority class is "Loan Repaid."
Node #6: A significant decision point occurs where instances meet the criteria of YOJ_missing_flag being less than or equal to 0.5. These instances are classified as "Defaulted on Loan."
Node #8: Within the "Defaulted on Loan" class, a small subset meets certain criteria and is classified as "Loan Repaid."
Node #13: Within this node, some instances are classified as "Loan Repaid."
Node #20: Another scenario arises where instances are classified as "Loan Repaid" within this node.
Node #28: In this node, certain instances are classified as "Loan Repaid."
Overall, the decision tree employs a series of criteria to make predictions about loan repayment and defaults. It starts with the presence of DEBTINC_missing_flag and further investigates various features like DEBTINC, VALUE_missing_flag, CLAGE, YOJ_missing_flag, NINQ, VALUE, and JOB_missing_flag. At each node, it assigns instances to the most likely class, either "Loan Repaid" or "Defaulted on Loan," based on the Gini impurity measure. Nodes with lower impurity are more confident in their predictions, while higher impurity nodes may have greater uncertainty. The tree identifies specific scenarios and criteria that contribute to these classifications, providing insights into loan default prediction.
Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.
The results from all the decision trees are combined together and the final prediction is made using voting or averaging.
# Fitting the random forest tree classifier on the training data
rf_estimator = RandomForestClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(X_train)
classification_metrics(y_train, y_pred_train_rf)
Classification Report:
precision recall f1-score support
Loan Repaid 1.00 1.00 1.00 3355
Defaulted on Loan 1.00 1.00 1.00 817
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Observations:
The training results for the Random Forest model are quite remarkable. With precision, recall, and F1-scores all at 1.00, it suggests that the model is performing exceptionally well in classifying both "Loan Repaid" and "Defaulted on Loan" instances in the training data.
This level of performance typically raises questions about the possibility of overfitting, especially when the model's performance on the training data is perfect.
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(X_test)
classification_metrics(y_test, y_pred_test_rf)
Classification Report:
precision recall f1-score support
Loan Repaid 0.91 0.96 0.93 1416
Defaulted on Loan 0.82 0.62 0.71 372
accuracy 0.89 1788
macro avg 0.86 0.79 0.82 1788
weighted avg 0.89 0.89 0.89 1788
rf_estimator_test = model_performance_classification(rf_estimator, X_test, y_test)
rf_estimator_test
{'Precision': 0.86, 'Recall': 0.79, 'Accuracy': 0.89}
Observations:
# Calculate the feature importances
important_features = rf_estimator.feature_importances_
# Storing the names of all features
columns = X.columns
# Creating a data frame to store the calculated feature importances
important_features_df = pd.DataFrame(important_features, index = columns, columns= ['Importance']).sort_values(by = 'Importance', ascending = False)
# Display the feature importances
sns.barplot(x = important_features_df.Importance, y = important_features_df.index)
plt.title('Feature Importances')
Text(0.5, 1.0, 'Feature Importances')
Observations:
This feature's importance suggests that the actual debt-to-income ratio the most important factor in predicting loan default.
Similar to the tuned decision tree, the presence of this DEBTINC_missing_flag indicates that the model has learned to consider whether DEBTINC is missing as an important piece of information in predicting loan default.
# Selecting the classifier - Random Forest
rf_estimator_tuned = RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, random_state=1)
#decision_tree_tuned = DecisionTreeClassifier(class_weight = {0: 0.3, 1: 0.7}, random_state= 1)
# Grid of parameters to choose from
parameters = {
#'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
#'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy']
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label=1)
# Running the grid search
gridCV = GridSearchCV(rf_estimator_tuned, parameters, scoring=scorer, cv=10)
# Fitting the grid search on training data
gridCV = gridCV.fit(X_train, y_train)
# Using the best combinations of parameters on the classifier
rf_estimator_tuned = gridCV.best_estimator_
# Fitting the best estimator to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, max_depth=10,
min_samples_leaf=2, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, max_depth=10,
min_samples_leaf=2, random_state=1)# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)
classification_metrics(y_train, y_pred_train_rf_tuned)
Classification Report:
precision recall f1-score support
Loan Repaid 0.96 0.93 0.94 3355
Defaulted on Loan 0.75 0.82 0.78 817
accuracy 0.91 4172
macro avg 0.85 0.88 0.86 4172
weighted avg 0.91 0.91 0.91 4172
Observations:
# Checking performance on the training data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)
classification_metrics(y_test, y_pred_test_rf_tuned)
Classification Report:
precision recall f1-score support
Loan Repaid 0.93 0.93 0.93 1416
Defaulted on Loan 0.72 0.73 0.73 372
accuracy 0.89 1788
macro avg 0.83 0.83 0.83 1788
weighted avg 0.89 0.89 0.89 1788
rf_estimator_tuned_test = model_performance_classification(rf_estimator_tuned, X_test, y_test)
rf_estimator_tuned_test
{'Precision': 0.83, 'Recall': 0.83, 'Accuracy': 0.89}
Observations:
# Create a dictionary of models
models = {
'logistic_model': logistic_model,
'decision_tree': decision_tree,
'decision_tree_tuned': decision_tree_tuned,
'rf_estimator': rf_estimator,
'rf_estimator_tuned': rf_estimator_tuned
}
# Create an empty list to store the results
results = []
# Iterate through the models and compute the performance metrics
for model_name, model in models.items(): # Use .items() to iterate through both keys and values
metrics = model_performance_classification(model, X_test, y_test)
results.append({'Model': model_name, **metrics})
# Create a DataFrame from the results
comparing_test_models_df = pd.DataFrame(results)
# Set the 'Model' column as the index
comparing_test_models_df.set_index('Model', inplace=True)
# Display the DataFrame
print(comparing_test_models_df)
Precision Recall Accuracy Model logistic_model 0.61 0.50 0.79 decision_tree 0.80 0.75 0.86 decision_tree_tuned 0.80 0.82 0.87 rf_estimator 0.86 0.79 0.89 rf_estimator_tuned 0.83 0.83 0.89
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
2. Refined insights:
3. Proposal for the final solution design:
What model do you propose to be adopted? Why is this the best solution to adopt?
For the final solution, I propose adopting an ensemble learning approach, specifically the "rf_estimator_tuned" model. Random forest models have proven to be strong performers in this classification problem, achieving an accuracy of 0.89, with potential for further improvement. The "tuned" version offers a balanced performance between precision and recall (0.83), making it a reliable choice for predicting loan defaults.
The rationale for adopting this model is its ability to handle complex relationships within the data, providing robust predictions while controlling overfitting. The tuning process further enhances its performance. Additionally, the random forest model can handle both categorical and numerical features effectively. To achieve even better results, further data preprocessing and feature engineering could be explored. The final solution design should include ongoing model monitoring and retraining to adapt to changing patterns and improve its performance over time.
Title: Optimizing Lending Decisions with Advanced Data Analysis
Introduction: In a dynamic financial landscape, the accuracy of lending decisions is paramount. This project focuses on harnessing data analysis and predictive modeling to enhance the precision of lending decisions and mitigate loan default risks, ultimately aiming for deployment in the organization.
Problem Statement: For the organization, minimizing loan defaults and maximizing lending profits are imperative. Loan classification accuracy directly impacts profitability, and our previous standard practices yielded an 80% positive return. However, this project's aim is to further improve the accuracy of loan classifications, specifically predicting whether a loan will be repaid or defaulted.
Methodology: Several advanced level machine learning models were leveraged, including Logistic Regression, Decision Trees, and Random Forest, each meticulously evaluated with a focus on precision, recall, accuracy, and the F1 score.
Key Findings: The initial analysis unveiled several opportunities for improvement. Missing data in certain columns and data cleaning requirements were identified. Additionally, several numerical variables displayed significant variability, while specific categorical variables like "REASON" and "JOB" emerged as potential contributors to loan outcomes.
Solution Design: The final proposed solution design is underpinned by the tuned robust Random Forest model. This model combines accuracy, interpretability, and robustness, capitalizing on essential factors such as credit history, debt-to-income ratios, and property values.
Next Steps:
Conclusion: This project equips the organization with a powerful tool to minimize loan default risks and improve the precision of lending decisions. The final model, the tuned Random Forest algorithm, showcases impressive results, with an accuracy of 89%, a precision of 83%, and a recall of 83%. Additionally, the tuned random forest model demonstrates an F1 score of 0.73 for identifying 'Defaulted on Loan' cases. These metrics reflect a significant enhancement in loan classification accuracy, positioning the organization for success in an ever-evolving financial landscape and emphasizing the model's contribution to our goals."
For the organization, minimizing loan defaults and maximizing lending profits are imperative. Loan classification accuracy directly impacts profitability, and the previous standard practices yielded an 80% positive return. However, this project's aim is to further improve the accuracy of loan classifications, specifically predicting whether a loan will be repaid or defaulted, paving the way for model deployment.
The proposed solution design encompasses several key elements:
Data Cleaning: Addressing missing values and data type issues is a crucial initial step. Clean, complete data is essential for building a reliable model.**
Model Building: The focus was on fine-tuning the Random Forest classification model, which has demonstrated its effectiveness in predicting loan outcomes.
Model Evaluation: The final model's performance is rigorously evaluated using a test dataset. This ensures that the model's predictions are accurate and dependable.
The proposed solution design is an excellent fit for the problem at hand for several reasons:
Accuracy and Risk Mitigation: By deploying this model, the aim is to enhance the precision of loan classification. This will significantly reduce the risk of approving loans that may eventually lead to defaults.
Efficiency: An efficient model will streamline the loan approval process, making it faster and more accurate. This leads to improved customer satisfaction and business performance.
Key Steps Taken: have rigorously followed the steps of data cleaning, model building, and model evaluation to ensure that our solution meets the highest standards of quality and reliability.
Business Impact: Implementing this model will lead to more informed lending decisions, ultimately affecting the business positively by reducing financial risks and potential losses.
Regarding model outcomes: The tuned random forest model demonstrated an approval rate of approximately 83% on loan applications, indicating that it accurately approved 83% of loan applications. This high approval rate signifies the model's cautious yet efficient approach to lending decisions, which directly contributes to more precise loan classifications and, consequently, enhanced lending practices.
False Positives and False Negatives: In the tuned Random Forest model, it was observed that 105 loans were incorrectly classified as defaulted on loan when, in reality, they were repaid. Additionally, 273 loans were incorrectly categorized as repaid when they had indeed defaulted on their loans. These results highlight the significance of managing and minimizing such errors, which have implications for risk assessment and financial decision-making. Managing these errors effectively is vital for the lending process's overall performance and reliability.
Comprehensive Data Monitoring: Implement a real-time data monitoring system to ensure ongoing data accuracy and relevancy. This will enable the model to adapt to evolving market conditions and customer behaviors effectively.
Automation of Decision Processes: Develop automated decision processes to streamline loan approval procedures. This will reduce manual effort, enhance efficiency, and ensure consistent and fair lending decisions.
Alternative Machine Learning Approaches: Continuously explore and experiment with alternative machine learning algorithms to enhance predictive accuracy. Diversify model development to adapt to changing customer demographics and lending trends.
Key Actionables for Stakeholders: Invest in Data Quality: Allocate resources and budgets to maintain data quality and consistency. Establish data governance processes to prevent data degradation and maintain model effectiveness.
Staff Training and Development: Provide staff with ongoing training and development opportunities to understand and leverage the model effectively. Ensuring that personnel are proficient in interpreting and using the model is vital.
Regular Model Audits: Conduct periodic model audits to verify its accuracy, fairness, and adherence to regulatory guidelines. Collaborate with data scientists and model developers to maintain model integrity.
Expected Benefits and Costs:
Benefits:
Improved Loan Decision Accuracy: The model is expected to increase loan approval precision by 3%, resulting in reduced risk and higher return on approved loans.
The 3% improvement in positive return, based on the average loan amount, is worth approximately $3,340,192.
Efficiency Gains: Automated processes will lead to operational cost savings and quicker decision-making, reducing costs while maintaining or improving customer satisfaction.
Reduced Default Rates: The model will mitigate loan default risks, leading to lower provisions for bad loans and boosting overall profitability.
Costs:
Initial Implementation Costs: Invest in technology, infrastructure, and personnel training.
Ongoing Monitoring and Maintenance: Allocate resources for ongoing data monitoring, system maintenance, and staff training.
Key Risks and Challenges:
Data Security and Privacy: Safeguard customer data to ensure compliance with data protection regulations and to maintain trust among customers.
Model Fairness and Bias: Regularly assess model fairness to avoid unintended discrimination and bias. Continuously fine-tune the model to minimize these risks.
Evolving Market Dynamics: Keep a close eye on changing market dynamics and customer preferences to ensure that the model remains relevant.
Further Analysis and Problem Solving:
Market Segmentation: Conduct in-depth market segmentation analysis to tailor lending decisions for different customer segments more accurately.
Enhancing Explainability: Invest in model explainability techniques to provide clearer justification for approval and rejection decisions, enhancing transparency and trust.
Predictive Model Integration: The model has demonstrated promising results, including a 3% improvement over the current lending process, signifying its potential to enhance loan decision accuracy. While the tuned Random Forest model demonstrates its readiness for immediate deployment with an accuracy of 89%, a precision of 83%, a recall of 83%, and a F1 score of 73%, we also recognize the importance of a phased approach. The positive outcomes justify further exploration and integration into various banking processes, ensuring a seamless transition. This step will allow for an assessment of the model's impact, fine-tuning of its performance, and addressing any potential challenges while maintaining a robust and cautious deployment strategy.